Austin Animal Shelter¶

Animal welfare is something that is really close to my heart so I am really excited to go through this dataset for analysis. This data was posted on Kaggle but I did not discover the dataset until after the competition expired. The goal posted by Austin Animal Shelter for this data was to predict the outcome of each animal.

In [187]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix

import category_encoders as ce


import xgboost as xgb
import seaborn as sns

%matplotlib inline

Load Data¶

In [188]:

df = pd.read_csv('train.csv')
orig_df = pd.read_csv('train.csv')

In [189]:

df.shape

Out[189]:

(26729, 10)

In [190]:

df.head()

Out[190]:

	AnimalID	Name	DateTime	OutcomeType	OutcomeSubtype	AnimalType	SexuponOutcome	AgeuponOutcome	Breed	Color
0	A671945	Hambone	2014-02-12 18:22:00	Return_to_owner	NaN	Dog	Neutered Male	1 year	Shetland Sheepdog Mix	Brown/White
1	A656520	Emily	2013-10-13 12:44:00	Euthanasia	Suffering	Cat	Spayed Female	1 year	Domestic Shorthair Mix	Cream Tabby
2	A686464	Pearce	2015-01-31 12:28:00	Adoption	Foster	Dog	Neutered Male	2 years	Pit Bull Mix	Blue/White
3	A683430	NaN	2014-07-11 19:09:00	Transfer	Partner	Cat	Intact Male	3 weeks	Domestic Shorthair Mix	Blue Cream
4	A667013	NaN	2013-11-15 12:52:00	Transfer	Partner	Dog	Neutered Male	2 years	Lhasa Apso/Miniature Poodle	Tan

In [191]:

df.columns

Out[191]:

Index([u'AnimalID', u'Name', u'DateTime', u'OutcomeType', u'OutcomeSubtype',
       u'AnimalType', u'SexuponOutcome', u'AgeuponOutcome', u'Breed',
       u'Color'],
      dtype='object')

Convert age column to age in years¶

The age column contains strings and the unit of age is not consistent. So, we will convert the age column into age in years. There are some entries with no age value. We will first fill in those null values with -99, and then replace them with the average age.

In [192]:

df['AgeuponOutcome'].unique()

Out[192]:

array(['1 year', '2 years', '3 weeks', '1 month', '5 months', '4 years',
       '3 months', '2 weeks', '2 months', '10 months', '6 months',
       '5 years', '7 years', '3 years', '4 months', '12 years', '9 years',
       '6 years', '1 weeks', '11 years', '4 weeks', '7 months', '8 years',
       '11 months', '4 days', '9 months', '8 months', '15 years',
       '10 years', '1 week', '0 years', '14 years', '3 days', '6 days',
       '5 days', '5 weeks', '2 days', '16 years', '1 day', '13 years', nan,
       '17 years', '18 years', '19 years', '20 years'], dtype=object)

In [193]:

Ages = df['AgeuponOutcome'].astype(str)
y = [float(age.split()[0]) if 'year' in age
     else float(age.split()[0])/52. if 'week' in age
     else float(age.split()[0])/12. if 'month' in age
     else float(age.split()[0])/352. if 'week' in age
     else -99.0
     for age in Ages]

In [194]:

y[y == -99] = np.mean(y[y != -99])

Here, we replace the missing ages with the average age of the animals¶

In [195]:

df['AgeuponOutcome'] = pd.DataFrame({'AgeuponOutcome': y})

In [196]:

df.head()

Out[196]:

	AnimalID	Name	DateTime	OutcomeType	OutcomeSubtype	AnimalType	SexuponOutcome	AgeuponOutcome	Breed	Color
0	A671945	Hambone	2014-02-12 18:22:00	Return_to_owner	NaN	Dog	Neutered Male	1.000000	Shetland Sheepdog Mix	Brown/White
1	A656520	Emily	2013-10-13 12:44:00	Euthanasia	Suffering	Cat	Spayed Female	1.000000	Domestic Shorthair Mix	Cream Tabby
2	A686464	Pearce	2015-01-31 12:28:00	Adoption	Foster	Dog	Neutered Male	2.000000	Pit Bull Mix	Blue/White
3	A683430	NaN	2014-07-11 19:09:00	Transfer	Partner	Cat	Intact Male	0.057692	Domestic Shorthair Mix	Blue Cream
4	A667013	NaN	2013-11-15 12:52:00	Transfer	Partner	Dog	Neutered Male	2.000000	Lhasa Apso/Miniature Poodle	Tan

Basic Stats for Dogs vs Cats¶

Average age upon Outcome¶

In [197]:

df[df['AnimalType'] == 'Dog']['AgeuponOutcome'].mean()

Out[197]:

2.2319505758749227

In [198]:

df[df['AnimalType'] == 'Cat']['AgeuponOutcome'].mean()

Out[198]:

-1.6021397843519833

Number of dogs vs cats¶

In [199]:

len(df[df['AnimalType'] == 'Dog'])

Out[199]:

In [200]:

len(df[df['AnimalType'] == 'Cat'])

Out[200]:

Summary on basic stats: The average age for a dog (2.75 yrs) is greater than the average age of cats (1.36 yrs) upon the outcome. The shelter also has sheltered more dogs than cats over the course of 3 years.

Create columns for male vs female, fixed vs intact¶

In [201]:

df['SexuponOutcome'].unique()

Out[201]:

array(['Neutered Male', 'Spayed Female', 'Intact Male', 'Intact Female',
       'Unknown', nan], dtype=object)

In [202]:

df['SexuponOutcome'] = df['SexuponOutcome'].replace(np.nan,'Unknown')

In [203]:

sex = ['male' if 'Male' in df['SexuponOutcome'].iloc[i] 
       else 'female' if 'Female' in df['SexuponOutcome'].iloc[i] 
       else 'unknown'
       for i in range(len(df['SexuponOutcome']))]

spayed_neutered = ['fixed' if any(x in df['SexuponOutcome'].iloc[i] for x in ['Neutered','Spayed'])
                    else 'not_fixed' if 'Intact' in df['SexuponOutcome'].iloc[i]
                    else 'unknown_fixed'
                    for i in range(len(df['SexuponOutcome']))]

In [204]:

set(spayed_neutered)

Out[204]:

{'fixed', 'not_fixed', 'unknown_fixed'}

In [205]:

set(sex)

Out[205]:

{'female', 'male', 'unknown'}

In [206]:

df['Fixed'] = pd.DataFrame({'fixed':spayed_neutered})
df['Sex'] = pd.DataFrame({'Sex': sex})
df = df.drop('SexuponOutcome',axis = 1)

Simplify the Breed Feature¶

In [207]:

df.head()

Out[207]:

	AnimalID	Name	DateTime	OutcomeType	OutcomeSubtype	AnimalType	AgeuponOutcome	Breed	Color	Fixed	Sex
0	A671945	Hambone	2014-02-12 18:22:00	Return_to_owner	NaN	Dog	1.000000	Shetland Sheepdog Mix	Brown/White	fixed	male
1	A656520	Emily	2013-10-13 12:44:00	Euthanasia	Suffering	Cat	1.000000	Domestic Shorthair Mix	Cream Tabby	fixed	female
2	A686464	Pearce	2015-01-31 12:28:00	Adoption	Foster	Dog	2.000000	Pit Bull Mix	Blue/White	fixed	male
3	A683430	NaN	2014-07-11 19:09:00	Transfer	Partner	Cat	0.057692	Domestic Shorthair Mix	Blue Cream	not_fixed	male
4	A667013	NaN	2013-11-15 12:52:00	Transfer	Partner	Dog	2.000000	Lhasa Apso/Miniature Poodle	Tan	fixed	male

In [208]:

len(df['Breed'].unique())

Out[208]:

Currently, there are 1380 different breed classifications among cats and dogs¶

In [209]:

df['Breed'].value_counts()[0:10]

Out[209]:

Domestic Shorthair Mix       8810
Pit Bull Mix                 1906
Chihuahua Shorthair Mix      1766
Labrador Retriever Mix       1363
Domestic Medium Hair Mix      839
German Shepherd Mix           575
Domestic Longhair Mix         520
Siamese Mix                   389
Australian Cattle Dog Mix     367
Dachshund Mix                 318
Name: Breed, dtype: int64

In [210]:

df[df['AnimalType'] == 'Cat']['Breed'].value_counts()[0:10]

Out[210]:

Domestic Shorthair Mix      8810
Domestic Medium Hair Mix     839
Domestic Longhair Mix        520
Siamese Mix                  389
Domestic Shorthair           143
Snowshoe Mix                  75
Maine Coon Mix                44
Manx Mix                      44
Domestic Medium Hair          42
Russian Blue Mix              33
Name: Breed, dtype: int64

In [211]:

len(df[df['AnimalType'] == 'Cat']['Breed'].unique())

Out[211]:

In [212]:

len(df[df['AnimalType'] == 'Dog']['Breed'].unique())

Out[212]:

Domestic Shorthair Mix seems to be a classification for cats only¶

Cat breeds seem to be somewhat limited with 60 different breeds while dogs have 1320 different classifications. We will only simplify dog breeds.

In [213]:

df[df['AnimalType'] == 'Dog']['Breed'].value_counts()[0:20]

Out[213]:

Pit Bull Mix                 1906
Chihuahua Shorthair Mix      1766
Labrador Retriever Mix       1363
German Shepherd Mix           575
Australian Cattle Dog Mix     367
Dachshund Mix                 318
Boxer Mix                     245
Miniature Poodle Mix          233
Border Collie Mix             229
Australian Shepherd Mix       163
Rat Terrier Mix               157
Catahoula Mix                 157
Jack Russell Terrier Mix      146
Yorkshire Terrier Mix         143
Chihuahua Longhair Mix        142
Siberian Husky Mix            138
Miniature Schnauzer Mix       136
Beagle Mix                    124
Rottweiler Mix                113
American Bulldog Mix          109
Name: Breed, dtype: int64

Let's remove the "Mix" from the dog breeds and also take only the first breed from classifications in the form "breed 1 / breed 2", assuming that the first breed listed is the dominant breed¶

In [214]:

import re
breeds = list(df[df['AnimalType'] == 'Dog']['Breed'])
dog_breeds = [re.sub(' Mix', '', dog) if 'Mix' in dog else dog.split("/")[0] for dog in breeds]
# for i in range(len(dog_breeds)):
#     if ('Mix' in dog_breeds[i]):
#         dog_breeds[i] = re.sub(' Mix', '', dog_breeds[i])
#     else:
#         dog_breeds[i] = dog_breeds[i].split("/")[0]

In [215]:

len(set(dog_breeds))

Out[215]:

In [216]:

dog_breeds_df = pd.DataFrame({'Breed':dog_breeds})

We were able to narrow down the different classes of dog breeds from 1320 to 188¶

In [217]:

dog_breeds_df['Breed'].value_counts()[0:10]

Out[217]:

Chihuahua Shorthair      2145
Pit Bull                 2113
Labrador Retriever       1915
German Shepherd           826
Australian Cattle Dog     511
Dachshund                 510
Boxer                     360
Border Collie             334
Miniature Poodle          310
Australian Shepherd       229
Name: Breed, dtype: int64

Now let's replace all the dog breed classifications from the original dataframe with the new breed classifications¶

In [218]:

df.ix[df.AnimalType == 'Dog', 'Breed'] = dog_breeds

In [219]:

len(df['Breed'].unique())

Out[219]:

In [220]:

x = df['Breed'].value_counts().index.tolist()
y = df['Breed'].value_counts().tolist()

In [221]:

common_breed = df['Breed'].value_counts().index.tolist()[0:10]
pd.crosstab(df[df['Breed'].isin(common_breed)]['Breed'],df['OutcomeType'])

Out[221]:

OutcomeType	Adoption	Died	Euthanasia	Return_to_owner	Transfer
Breed
Australian Cattle Dog	243	1	20	126	121
Chihuahua Shorthair	966	13	90	478	598
Dachshund	243	0	15	95	157
Domestic Longhair Mix	199	12	56	48	205
Domestic Medium Hair Mix	357	11	62	26	383
Domestic Shorthair Mix	3273	112	535	352	4538
German Shepherd	372	3	37	237	177
Labrador Retriever	857	4	82	498	474
Pit Bull	625	8	287	678	515
Siamese Mix	173	5	30	27	154

Simplify Dog Color Column¶

The color column has 366 different colors. We will simplify this column similar to the breed column, taking the first color as the dominant color.¶

In [222]:

len(df['Color'].unique())

Out[222]:

In [223]:

len(df[df['AnimalType'] == 'Dog']['Color'].unique())

Out[223]:

In [224]:

df[df['AnimalType'] == 'Dog']['Color'].value_counts()[0:15]

Out[224]:

Black/White            1730
Brown/White             882
Black                   851
White                   806
Tan/White               773
Tricolor                751
Black/Tan               672
Brown                   637
Tan                     627
White/Brown             562
White/Black             475
Brown Brindle/White     450
Black/Brown             435
Blue/White              414
White/Tan               389
Name: Color, dtype: int64

In [225]:

df[df['AnimalType'] == 'Cat']['Color'].value_counts()[0:15]

Out[225]:

Brown Tabby           1635
Black                 1441
Black/White           1094
Brown Tabby/White      940
Orange Tabby           841
Tortie                 530
Calico                 517
Orange Tabby/White     455
Blue Tabby             433
Blue                   385
Torbie                 335
Blue/White             288
Blue Tabby/White       241
Cream Tabby            198
White/Black            168
Name: Color, dtype: int64

In [226]:

# Take the first color
colors = df['Color']
color = [c.split("/")[0] for c in df['Color']]

In [227]:

len(pd.DataFrame({'Color':color})['Color'].unique())

Out[227]:

In [228]:

pd.DataFrame({'Color':color})['Color'].value_counts()[0:15]

Out[228]:

Black            6422
White            3344
Brown Tabby      2592
Brown            1951
Tan              1674
Orange Tabby     1299
Blue             1199
Tricolor          800
Red               779
Brown Brindle     699
Blue Tabby        678
Tortie            580
Calico            552
Chocolate         448
Torbie            398
Name: Color, dtype: int64

Taking the first listed color as the dominant color, we reduced the number of possibilites for color definition from 366 to 57

In [229]:

df['Color'] = color

In [230]:

df.head()

Out[230]:

	AnimalID	Name	DateTime	OutcomeType	OutcomeSubtype	AnimalType	AgeuponOutcome	Breed	Color	Fixed	Sex
0	A671945	Hambone	2014-02-12 18:22:00	Return_to_owner	NaN	Dog	1.000000	Shetland Sheepdog	Brown	fixed	male
1	A656520	Emily	2013-10-13 12:44:00	Euthanasia	Suffering	Cat	1.000000	Domestic Shorthair Mix	Cream Tabby	fixed	female
2	A686464	Pearce	2015-01-31 12:28:00	Adoption	Foster	Dog	2.000000	Pit Bull	Blue	fixed	male
3	A683430	NaN	2014-07-11 19:09:00	Transfer	Partner	Cat	0.057692	Domestic Shorthair Mix	Blue Cream	not_fixed	male
4	A667013	NaN	2013-11-15 12:52:00	Transfer	Partner	Dog	2.000000	Lhasa Apso	Tan	fixed	male

Simplify the date and time column¶

In [231]:

df['DateTime'].head()

Out[231]:

0    2014-02-12 18:22:00
1    2013-10-13 12:44:00
2    2015-01-31 12:28:00
3    2014-07-11 19:09:00
4    2013-11-15 12:52:00
Name: DateTime, dtype: object

In [232]:

from datetime import datetime
d = df['DateTime'][0].split()[0]
dates = [df['DateTime'][i].split()[0] for i in range(len(df['DateTime']))]
date = [datetime.strptime(d,'%Y-%m-%d') for d in dates]

In [233]:

# season = ['Winter' if d.month % 11 <= 2
#          else 'Spring' if (d.month % 11 >= 3) & (d.month % 11 <= 5)
#          else 'Summer' if (d.month % 11 >= 6) & (d.month % 11 <= 8)
#          else 'Fall'
#          for d in date]

In [234]:

week = [1 if d.day <= 7
       else 2 if (d.day > 7) & (d.day <= 14)
       else 3 if (d.day > 14) & (d.day <= 21)
       else 4
       for d in date]

In [235]:

# df['Season'] = season
df['Week'] = week

In [236]:

df.head()

Out[236]:

	AnimalID	Name	DateTime	OutcomeType	OutcomeSubtype	AnimalType	AgeuponOutcome	Breed	Color	Fixed	Sex	Week
0	A671945	Hambone	2014-02-12 18:22:00	Return_to_owner	NaN	Dog	1.000000	Shetland Sheepdog	Brown	fixed	male	2
1	A656520	Emily	2013-10-13 12:44:00	Euthanasia	Suffering	Cat	1.000000	Domestic Shorthair Mix	Cream Tabby	fixed	female	2
2	A686464	Pearce	2015-01-31 12:28:00	Adoption	Foster	Dog	2.000000	Pit Bull	Blue	fixed	male	4
3	A683430	NaN	2014-07-11 19:09:00	Transfer	Partner	Cat	0.057692	Domestic Shorthair Mix	Blue Cream	not_fixed	male	2
4	A667013	NaN	2013-11-15 12:52:00	Transfer	Partner	Dog	2.000000	Lhasa Apso	Tan	fixed	male	3

Create Indicator Column whether the animal was given a name or not¶

In [237]:

df['Name'] = df['Name'].isnull() * 1.
df = df.rename(columns = {'Name':'NoName'})

Create column for indicator of Animal Type¶

In [238]:

df['AnimalType'].unique()

Out[238]:

array(['Dog', 'Cat'], dtype=object)

In [239]:

df['AnimalType'] = (df['AnimalType'] == 'Dog') * 1.

Create Dummy Variables for Columns with Categorical Variables¶

In [240]:

categorical_var = ['Color', 'Breed','Fixed', 'Sex']
binary = ce.binary.BinaryEncoder(cols = categorical_var)
binary.fit(df)
dat = binary.transform(df)

# dat = pd.concat([df.drop(categorical_var, axis = 1), pd.get_dummies(df[categorical_var])], axis = 1)

In [241]:

pd.set_option('display.max_columns', 100)
dat.head()

Out[241]:

	Color_0	Color_1	Color_2	Color_3	Color_4	Color_5	Breed_0	Breed_1	Breed_2	Breed_3	Breed_4	Breed_5	Breed_6	Breed_7	Fixed_0	Sex_1	AnimalID	NoName	DateTime	OutcomeType	OutcomeSubtype	AnimalType	AgeuponOutcome	Week
0	1	0	1	1	0	1	1	1	0	1	0	0	0	0	1	0	A671945	0.0	2014-02-12 18:22:00	Return_to_owner	NaN	1.0	1.000000	2
1	0	0	1	1	0	1	0	1	0	0	1	1	1	1	1	1	A656520	0.0	2013-10-13 12:44:00	Euthanasia	Suffering	0.0	1.000000	2
2	0	1	1	1	1	1	0	1	1	0	0	1	1	0	1	0	A686464	0.0	2015-01-31 12:28:00	Adoption	Foster	1.0	2.000000	4
3	1	1	0	0	1	0	0	1	0	0	1	1	1	1	0	0	A683430	1.0	2014-07-11 19:09:00	Transfer	Partner	0.0	0.057692	2
4	1	0	0	0	0	0	0	0	0	0	0	1	0	1	1	0	A667013	1.0	2013-11-15 12:52:00	Transfer	Partner	1.0	2.000000	3

Check dimensions of data frame to check dummy function¶

Create Labels¶

In [242]:

lab = dat['OutcomeType']

In [243]:

lab.value_counts()

Out[243]:

Adoption           10769
Transfer            9422
Return_to_owner     4786
Euthanasia          1555
Died                 197
Name: OutcomeType, dtype: int64

In [244]:

le = LabelEncoder()
labels = le.fit_transform(dat['OutcomeType'])

Drop columns not needed for models¶

In [245]:

dat.drop(['AnimalID','DateTime','OutcomeType','OutcomeSubtype'],axis = 1).dtypes.unique()

Out[245]:

array([dtype('int64'), dtype('float64')], dtype=object)

In [246]:

dat = dat.drop(['AnimalID','DateTime','OutcomeType','OutcomeSubtype', 'AnimalType'],axis = 1)

Create Training and Testing Sets¶

In [247]:

X_train, X_test, Y_train, Y_test = train_test_split(dat, labels, test_size = 0.25)

In [248]:

df.OutcomeType.value_counts()/float(df.shape[0])

Out[248]:

Adoption           0.402896
Transfer           0.352501
Return_to_owner    0.179056
Euthanasia         0.058177
Died               0.007370
Name: OutcomeType, dtype: float64

In [249]:

np.bincount(Y_train)/float(len(Y_train))

Out[249]:

array([ 0.40177592,  0.00768233,  0.05706874,  0.17863913,  0.35483388])

In [250]:

np.bincount(Y_test)/float(len(Y_test))

Out[250]:

array([ 0.40625468,  0.00643424,  0.06149933,  0.18030824,  0.34550352])

We now need to normalize the training and testing set.

In [251]:

X_train = normalize(X_train, axis = 1)
X_test = normalize(X_test, axis = 1)

Logistic Regression Grid Search and Cross Validation¶

In [252]:

log_model = LogisticRegression(penalty = 'l1')
params = {'C':[0.25, 0.5, 0.75, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5]}
gs_model = GridSearchCV(log_model, params)
gs_model.fit(X_train, Y_train)

Out[252]:

GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': [0.25, 0.5, 0.75, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [253]:

gs_model.best_params_

Out[253]:

{'C': 4.5}

In [254]:

C_val = gs_model.best_params_['C']

best_log_model = LogisticRegression(C = C_val, penalty = 'l1')
best_log_model.fit(X_train, Y_train)
best_log_model.score(X_test, Y_test)

Out[254]:

0.6304055065090528

In [255]:

log_loss(Y_test, best_log_model.predict_proba(X_test))

Out[255]:

0.92958350834823378

In [256]:

cm = confusion_matrix(Y_test, best_log_model.predict(X_test))

plt.matshow(cm)
plt.colorbar()
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')

print cm

le.classes_

[[2317    0    0  301   97]
 [   4    0    0    5   34]
 [  62    0    0  109  240]
 [ 574    0    0  504  127]
 [ 704    0    0  213 1392]]

Out[256]:

array(['Adoption', 'Died', 'Euthanasia', 'Return_to_owner', 'Transfer'], dtype=object)

Gaussian Naive Bayes¶

In [257]:

nb_model = GaussianNB()
nb_model.fit(X_train, Y_train)
nb_model.score(X_test, Y_test)

Out[257]:

0.43872512344755349

Gradient Boosting¶

Grid Search for Gradient Boosting¶

In [258]:

# params = {'n_estimators': [200, 300, 450], 'learning_rate':[0.1, 0.5],
#           'max_depth':[5, 6, 7, 8], 'max_features':['sqrt', .40], 'min_samples_split':[2, 20, 30, 50, 100]}
params = {'n_estimators':[200, 300], 'max_features':['sqrt', 'log2'], 'max_depth':[5, 7]}
grid_search_gb = GridSearchCV(GradientBoostingClassifier(), params, verbose = 1)
grid_search_gb.fit(X_train, Y_train)

Fitting 3 folds for each of 8 candidates, totalling 24 fits

[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:  5.6min finished

Out[258]:

GridSearchCV(cv=None, error_score='raise',
       estimator=GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=100, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [200, 300], 'max_features': ['sqrt', 'log2'], 'max_depth': [5, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

In [259]:

grid_search_gb.best_params_

Out[259]:

{'max_depth': 5, 'max_features': 'sqrt', 'n_estimators': 200}

In [260]:

num_estimators = grid_search_gb.best_params_['n_estimators']
depth = grid_search_gb.best_params_['max_depth']
features = grid_search_gb.best_params_['max_features']

In [261]:

best_gb_model = GradientBoostingClassifier(n_estimators = num_estimators, learning_rate = 0.1, max_depth = depth)
best_gb_model.fit(X_train, Y_train)
best_gb_model.score(X_test, Y_test)

Out[261]:

0.63055513990722734

In [262]:

log_loss(Y_test, best_gb_model.predict_proba(X_test))

Out[262]:

0.89120449184557038

In [263]:

cm = confusion_matrix(Y_test, best_gb_model.predict(X_test))

plt.matshow(cm)
plt.colorbar()
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')

print cm

le.classes_

[[2244    3   12  289  167]
 [  10    0    0    2   31]
 [  57    1   62  102  189]
 [ 557    2   21  517  108]
 [ 637    8   53  220 1391]]

Out[263]:

array(['Adoption', 'Died', 'Euthanasia', 'Return_to_owner', 'Transfer'], dtype=object)

In [264]:

feat_imp = pd.DataFrame({'features':dat.columns.values,'values':best_gb_model.feature_importances_})

In [265]:

top_feat = feat_imp.sort('values').iloc[-10:]
top_feat

/home/truong/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:1: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  if __name__ == '__main__':

Out[265]:

	features	values
2	Color_2	0.042355
1	Color_1	0.044922
0	Color_0	0.047059
8	Breed_2	0.048157
17	Sex_1	0.049916
5	Color_5	0.051596
18	NoName	0.059181
14	Fixed_0	0.079279
20	Week	0.082495
19	AgeuponOutcome	0.120870

In [266]:

plt.figure(figsize = (20,10))
plt.title('Top 10 Features')
plt.xticks(fontsize = 15)
sns.barplot(top_feat['features'], 100.0 * top_feat['values']/np.max(best_gb_model.feature_importances_))

Out[266]:

XgBoost¶

In [267]:

params = {'n_estimators':[200, 300], 'max_depth':[4, 5, 6], 'subsample':[1, 0.8]}
gs_xgb = GridSearchCV(xgb.XGBClassifier(), params, verbose = 1)
gs_xgb.fit(X_train, Y_train)

Fitting 3 folds for each of 12 candidates, totalling 36 fits

[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:  2.1min finished

Out[267]:

GridSearchCV(cv=None, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [200, 300], 'subsample': [1, 0.8], 'max_depth': [4, 5, 6]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

In [268]:

gs_xgb.best_params_

Out[268]:

{'max_depth': 4, 'n_estimators': 200, 'subsample': 1}

In [269]:

depth = gs_xgb.best_params_['max_depth']
estimators = gs_xgb.best_params_['n_estimators']
sub_sample = gs_xgb.best_params_['subsample']

In [278]:

xgb_model = xgb.XGBClassifier(n_estimators = estimators, max_depth = depth, subsample = sub_sample,
                             objective = 'multi:softmax')
xgb_model.fit(X_train, Y_train, eval_metric = 'merror', eval_set = [(X_test, Y_test)],
              early_stopping_rounds = 150,verbose = False)
xgb_model.score(X_test, Y_test)

Out[278]:

0.64132874457578937

In [279]:

log_loss(Y_test, xgb_model.predict_proba(X_test))

Out[279]:

0.86109146072771603

Majority Vote¶

In [275]:

mv_model = VotingClassifier([('nb',nb_model), ('xgb',best_gb_model), ('log',best_log_model)], voting = 'soft')
mv_model.fit(X_train, Y_train)
mv_model.score(X_test, Y_test)

Out[275]:

0.59120155618734105

In [276]:

log_loss(Y_test, mv_model.predict_proba(X_test))

Out[276]:

0.99900816780210555

ZeroR Baseline¶

In [274]:

lab.value_counts()[0]/float(len(labels))

Out[274]:

0.40289573122825395

Using the zero rule baseline, we would get a 40% accuracy by "guessing" all animals will be adopted. After testing numerous models, the best result is 64% prediction accuracy.

In [ ]: